Published 06/29/2011
(Written by Miriam Boon, and featured in iSGTW.)
Did you know? Flops? Iops? For the uninitiated, that's floating point operations per second and in/out operations per second. A machine that can do lots of flops is completing calculations quickly. A machine that can do lots of iops, on the other hand, can read and write a lot of data really quickly - an invaluable characteristic in this day of data-intensive science. |
iSGTW: Between existing and planned machines, the San Diego Supercomputer Center has what may be the world's largest collection of flash-based machines. Are the SDSC machines intended to be on a spectrum, with standard hard drive HPC systems at one end, and a machine that is entirely solid state at the other end? Or are the various machines different mixes designed with different purposes in mind?
Richard Moore, SDSC deputy director: First, our systems are user-driven, not technology-driven, so each of our systems is targeted to a specific set of users and their applications.
When we wrote the Gordon proposal in 2008 in response to National Science Foundation's request for a data-intensive system, we highlighted the terrific potential for flash storage to cost-effectively fill the latency gap in the memory hierarchy between DRAM and spinning disk. Gordon was designed with enough flash to park very large datasets on it and mine those datasets without resorting to spinning disk. Dash is just a small prototype of Gordon. With Trestles, we swapped flash for spinning disk for local node storage (before Steve Jobs did it with the new MacBook Air) - it's used for fast local scratch and checkpointing long-running jobs.
Also, until the cost differential of flash and spinning disk narrows, systems still need spinning disk in the memory hierarchy. SDSC will soon deploy Data Oasis with more than 2PB of spinning disk for current HPC systems, and we have plans to expand this to 20PB over the coming years to support Gordon and future HPC systems.
Dash, pictured here, is an element of the Triton Resource, an integrated data-intensive resource primarily designed to support UC San Diego and UC researchers. Image courtesy of Ben Tolo, San Diego Supercomputer Center, UC San Diego. |
iSGTW: In the past, HPC systems have been measured by how many flops they can do; the storage has not typically drawn comment. I understand that flash-based systems are designed to excel at iops. Is there anything else that they excel at?
Allan Snavely, SDSC associate director: In addition to IOPS, we have a metric called DMC (Data Moving Capacity). Take the capacity of every level of the memory hierarchy in bytes (registers, cache, memory, flash, disk), divide by the latency in seconds to access that level, and then sum up all of these terms for a system. That is a metric that represent how much data the system can hold and how fast it can get to it. By this metric, Gordon is nearly as big and capable as the Kraken and Ranger systems that cost substantially more dollars. So we shine on data handling ability per dollar.
iSGTW: So, given the expense of solid state drives, it isn't that a solid state system can't excel at flops - it's that creating a top-tier flops machine AND using all solid state drives would be extremely expensive?
Snavely: That is correct. Flash is currently much more expensive than disk - about $4/GB as opposed to only $0.25/GB for disk. On the other hand flash is 10x to 100x faster on random I/O and uses about 10x less energy when you are not accessing it.
iSGTW: Do you expect that to change anytime soon?
Moore: We expect the tremendous growth in demand for flash in consumer products - cameras, PDAs, iPods, iPads, laptops - will narrow the cost differential between flash and disk over time, and flash will be increasingly affordable for HPC systems.
iSGTW: Are there computational scientific problems that are currently seriously hampered by the in/out speed of the drives on traditional HPC systems? Likewise, are there computational scientific problems that do require a top-tier machine in both flops and iops? Can you give examples?
Snavely: A good example of the former is identifying patterns in social networks data - this is hard or impossible to do on existing machines due to inadequate memory and disk that is too slow for random access patterns. A good example of the latter would be a coupled calculation - for example a simulation of a disaster (earthquake or tsunami) that required coupling a model of people's behavior (high iops) to a computational model of the destruction (high flops).
iSGTW: Some novel systems require a complete re-write of the code. To use a flash-based HPC system, will users need to make many changes to their code?
Snavely: It depends. It is easy to do standard I/O to a flash file system without changes to the code. In other cases, one will want to modify the code to take better advantage of flash. For example, we are writing out-of-core solvers to do social network analysis and that means rewriting the algorithm.
An out-of-core solver is a problem too big to fit into the memory of a computer, such that it has to solved on the disk. This requires frequent accesses to disk, which is slower. Think of the following analogy - there are problems you can solve just off the top of your head (they fit in your "memory") and there are problems you need pencil and paper to solve - these take longer. It's similar for computers.
iSGTW: Flash drives can play a role in grid computing too. Shawn, can you tell us a little about the Open Science Grid site in which you installed solid state drives? How about you, Rob?
Rob Gardner, University of Chicago, and OSG integration coordinator: We are using SSD drives for a small number of servers at UChicago - mainly for services requiring high performance databases. I am not aware of any wide-scale use of these, for example in worker nodes in a large cluster, or for "larger" filesystems.
Shawn McKee, University of Michigan: They were installed at the ATLAS Great Lakes Tier-2 (AGLT2) site. This USATLAS Tier-2 is physically located at the University of Michigan and Michigan State University. I am the director of this Tier-2 and its primary role is to support the Large Hadron Collider's ATLAS experiment by running production tasks as well as user and group analysis jobs.
We have approximately 4400 jobs slots and 1.9 Petabytes of dCache storage at AGLT2. In fact, we are one of the most productive (in terms of normalized CPU-hours) Tier-2 sites for all of LHC. We have also been supporting OSG virtual organizations including 'osg,' 'hcc,' 'ligo,' and 'glow.'
iSGTW: What motivated you to explore the use of solid state drives?
McKee: The primary motivation was observed bottlenecks in our infrastructure. Our SSD deployments (so far) have been on site service machines running the following services:
iSGTW: How long have the flash drives been up and running?
McKee: We have had SSDs behind our dCache services since summer of 2010. The NFS related SSDs are still being tested in various configurations. Current SSDs are SATA 3Gbps SSDs (50G and 100G; we have two of each size in production in four server locations). We recently purchased two 140G SATA2 (6 Gbps) SSDs for the NFS server location. Our intent is to use SAS2 SSDs (from either Hitachi or Toshiba) once they are available later this spring.
The SAS2 SSDs will replace the existing SSDs. The new versions have much higher IOPs (number of I/Os they can process per second) as well as a factor of three in throughput (bandwidth). We will continue to monitor our infrastructure looking for bottlenecks. If we find additional locations that could benefit from SSDs we will put more into production.
iSGTW: So you've had SSDs of one kind or another up and running for a while now. Have you noticed any changes in performance?
McKee: Prior to deploying SSDs on our dCache instances, nightly maintenance which include dumping our Chimera namespace was taking up to four hours. After we migrated to SSD this decreased to 45 minutes. Another impact was on our Storage Resource Management service. Before migration to SSD we had a background of SRM failures that were intermittent and load related. After the transition to SSD these background failures disappeared.
The number of SRM queries supported (related to storage activities) has significantly improved since we moved the underlying DB onto SSD. Jobs running on worker nodes are not directly affected since the worker nodes are not using SSDs. If the jobs utilize services that were upgraded by migration to SSDs they may see a secondary improvement from the faster response of the underlying service.
iSGTW: What could SSDs mean for grids such as OSG and EGI?
McKee: The ability to service multiple requests effectively can be very important for use on OSG worker nodes. ATLAS analysis jobs are one example.
ATLAS jobs typically stage-in their input files on the local worker node's "temp" area. This area is typically shared by all jobs running on that node. As the number of cores increases (as well as logical cores from what used to be called HyperThreading), the number of jobs competing for disk I/O increases. When jobs are vary I/O intensive this can result in the local disk spending much of its time "seeking" between locations on the disk. The result is that having a large number of I/O intensive jobs can significantly reduce the CPU use of the node because the jobs are waiting for their I/O requests to complete.
We have seen cases where running a node full of I/O intensive jobs has reduced the effective CPU use to under 10%. Improving the ability of nodes to support competing I/O intensive jobs is an area of great interest for us. SSDs, while potentially very effective at addressing this type of bottleneck, have typically been too expensive to add to every worker node. Other efforts have focused on adding more disks to a worker node to distribute the I/O amongst them, or have explored direct access to local storage nodes (as opposed to making local copies of the files). Our hope is that prices for SSDs will come down soon enough to make it feasible for us to have them on future worker nodes.
Peter Solagna, EGI: This is a problem recently raised inside the operations community and under discussion: the required disk throughput is raising with the number of cores. Unfortunately, so far, the available disk I/O speed is not published by the information system, nor is the technology (for example local SSD or iSCSI). Nevertheless, this is a problem that in the near future can generate strong requirements that should be considered by the technology providers.
iSGTW: So Shawn, are these machines available via OSG?
McKee: In our case some of the OSG services we support have SSDs underlying them. As I mentioned, our hope is that future worker nodes will have SSDs for local working areas. One likely scenario is that we will create alternate job queues for I/O-intensive jobs to use that will map onto SSD enabled worker nodes.
iSGTW: Does OSG have any plans for implementing the capability to match I/O-intensive jobs to SSD worker nodes?
Gardner: At the moment the focus is on scheduling to sites having highly parallel throughput capabilities, or HTPC, which is to co-schedule a large number of jobs onto a single machine. This is because many sites now have dual 6 core processors which can be run in hyperthreaded mode, two per core, or 24 job slots on a machine. This has the advantage of optimizing memory use for certain applications as well as access to local storage for reading and writing job outputs.
As far as I understand it, HTPC is OSG's first foray into "whole node scheduling." The group leading this is headed by Dan Fraser.
iSGTW: Dan, are there any plans for developing whole node scheduling for special cases other than HTPC?
Fraser, OSG production coordinator: As Rob said, we are implementing the capability of advertising HTPC enabled sites. Jobs can then be matched to these sites. One additional category we hope to implement is to allow for sites to advertise GPU capabilities. But we have no plans to advertise things like SSDs at this time.
iSGTW - http://www.isgtw.org/